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Abstract 

A fundamental problem in citation analysis is the prediction of the long-term citation impact of recent publica¬ 
tions. We propose a model to predict a probability distribution for the future number of citations of a publication. 
Two predictors are used: The impact factor of the journal in which a publication has appeared and the number 
of citations a publication has received one year after its appearance. The proposed model is based on quantile 
regression. We employ the model to predict the future number of citations of a large set of publications in the 
field of physics. Our analysis shows that both predictors (i.e., impact factor and early citations) contribute to the 
accurate prediction of long-term citation impact. We also analytically study the behavior of the quantile regres¬ 
sion coefficients for high quantiles of the distribution of citations. This is done by linking the quantile regression 
approach to a quantile estimation technique from extreme value theory. Our work provides insight into the influ¬ 
ence of the impact factor and early citations on the long-term citation impact of a publication, and it takes a step 
toward a methodology that can be used to assess research institutions based on their most recently published work. 

Keywords: citation analysis; citation impact; impact factor; prediction; quantile estimation; quantile regression. 


1 Introduction 


Citation counts are a popular indicator of the impact of scientific publications. In the evaluation of research 
institutions, bibliometric indicators based on the citations received by the publications of an institution often 
play an important role. However, the use of citation-based indicators is problematic when the impact of recent 
publications needs to be determined. One or two years after their appearance, most publications have received only 
a few citations. After one year, there are many publications with just one or two citations or even with no citations 
at all. Some of these publications may receive a lot of citations in later years, while others may attract hardly 
any attention in the future. This makes it difficult to determine the impact of recent publications. Nevertheless, 
research institutions often want their performance to be assessed based on their most recent work (jBommann 


20131. In this paper, we therefore propose a model for making predictions of the impact that recent publications 
will have in the long term. 
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Our model predicts the long-term citation impact of a publication based on two variables, namely the impact 
factor of the journal in which the publication has appeared and the number of early citations the publication has 
received. Early citations are defined as citations received in the year in which a publication appeared or in the 
year thereafter. The two predictors that we use are easily available, and contrary to for instance the prediction 
approach proposed by D. Wang, Song, and Barabasi ( |2013| l, they allow predictions to be made fairly soon after 
the appearance of a publication. Also, compared with other predictors that could be considered, such as the length 
of the reference list of a publication or the number of authors of a publication, the predictors that we use are 
relatively hard to manipulate. Earlier studies have shown that both impact factors and early citations are important 
predictors of future citations. In the next section, we will provide an overview of these earlier studies and we will 
discuss their relationship with our present work. 

Earlier studies on citation impact prediction have often focused on providing a point estimate of the future 
number of citations of a publication. Given the high degree of uncertainty in citation impact predictions, we 
believe that it is more relevant to know the probability that a publication will receive a certain number of citations 
in the future. We therefore do not predict the average number of citations that a publication is expected to attract in 
the future, but instead we predict a probability distribution for the future number of citations of a publication. To 
predict this probability distribution conditional on a publication’s impact factor and its early citations, we employ 
the technique of quantile regression introduced by Koenker and Bassett ( |1978[ ). 

We also study the relationship between our prediction model based on quantile regression and results from ex¬ 
treme value theory. To do so, we first use so-called Zenga plots, introduced recently by |Cirillo] ( |2013) l, to establish 
that the citation distributions obtained in our analysis have a Pareto tail. This result then enables us to provide 
analytical insight into the behavior of the quantile regression coefficients for high quantiles. More specifically, we 
are able to link the regression coefficients to an estimator for the tail quantiles of a Pareto distribution developed 
in the framework of extreme value theory (Dekkers, Einmahl, & De Haan |1989| l. 

We use citation data for a large set of publications in the field of physics to test our prediction approach. The 
data is taken from the Web of Science database. 

The paper is organized as follows. Eirst, Sectionj^discusses how our research relates to earlier work reported 
in the literature. Next, Section describes the data that were used in our analysis. Section then introduces 
our model for predicting the long-term citation impact of publications, conditional on impact factors and early 
citations. Section [^presents our empirical results. Sections o and |5.2| focus on the values obtained for the 
parameters of our model. Sections [5.3||5.4|[53j and |5.6| address the fit of this model to the data and the predictive 
power of the model. Section|^studies the relationship between our model and results from extreme value theory. 
Sectionj^addresses the sensitivity of the parameters of our model to differences between fields of science, focusing 
on the fields of biology, chemistry and physics. Einally, Section [^concludes the paper. 


2 Relation with earlier work 


There is an extensive literature on modeling or predicting the number of citations of a publication based on all 
kinds of variables. An early study in this literature is the work by [Peters and Van Raan| ( |T994| l, who investigate 
the determinants of the citation impact of chemical engineering publications. More recent work in this literature 


M. Wang, Yu, and 


Yu (2011)1, M. Wang et al. (2012|l, 

Didegah and Thelwall (2013a 2013b|l, Bommann, Leydesdorff, and Wang 

(2013J, Yu, Yu, Li, and Wang ( 

2014 

1 , and Onodera and Yoshikane|(in pressjl. Various studies have also appeared 


in non-bibliometric journals (e.g., Haslam & Koval 2010 Lokker, McKibbon, McKinlay, Wilczynski, & Haynes 
|2008[ [Mingers & Xu 2010j l. Recent overviews of the literature on modeling or predicting citation impact are 
provided by Didegah and Thelwall (2013a 2013b| l and Onodera and Yoshikane ( jin pressj l. Examples of variables 
that have been found to predict citation impact include the impact factor of the journal in which a publication has 
appeared, the type of study (e.g., original research vs. literature review), the number of pages of a publication, the 
number of references of a publication, the number of authors, institutions, and countries in a publication’s address 
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list, and the past performance of these authors, institutions, and countries. 

It is important to emphasize that the objective of our work is different from the studies mentioned above. Like 
the above-mentioned studies, our interest is in predicting citation impact. However, our more specific interest is in 
using citation impact predictions in the evaluation of researchers, research groups, research institutions, and so on. 
In this specific context, many of the variables that have been found to correlate with citation impact should not be 
used for making citation impact predictions. Some variables have the problem that they can be easily manipulated. 
For instance, suppose researchers know that they will be evaluated based on the predicted citation impact of their 
publications, and suppose researchers also know that the citation impact of a publication will be predicted based 
on, for instance, the number of pages or the number of references of the publication. In that case, in order to be 
evaluated more favorably, it may be tempting for researchers to try to artificially increase the number of pages or 
the number of references of their publications. Hence, researchers may try to manipulate the variables that are 
used to make citation impact predictions. Other variables have the problem that they may lead to an undesirable 
self-reinforcing effect. For instance, suppose researchers are evaluated based on the predicted citation impact of 
their publications, and suppose the citation impact of a publication is predicted based on the citation impact of the 
earlier work of the authors of the publication. In that case, researchers who were successful in their older work 
will automatically be predicted to be successful also in their more recent work. This creates a self-reinforcing 
effect. Future success is determined by past success. 

In order to avoid problems related to manipulation and self-reinforcing effects, we aim to predict the citation 
impact of a publication based on indicators that are available shortly after the publication’s appearance and that 
can be considered to provide an impression of the value of the publication for the scientific community. Our focus 
is specifically on two indicators, namely the impact factor of the journal in which a publication has appeared 
and the number of citations a publication has received during the first year after its appearance. Other indicators 
that could be used are the number of downloads of a publication (|Bro dy, Har nad, & Cm 2006| l, the number 
of readers according to a service such as Mendeley ( |Thelwall & Wilson in press I, and other types of altmetric 
indicators (Costas, Zahedi, & Wouters in pressj l. In this paper, however, our focus is on impact factors and early 
citations. 


The use of early citations to predict long-term citation impact has been studied in various papers. Glanzel 


( 1997| l, Burrell (20031, Mingers and Burrell (2006 1 , (Mingers 2008| l, and D. Wang et al. (20131 propose mathe¬ 
matical models that describe how publications accumulate citations over time. Using these models, they predict 
the citation impact of a publication in the longer term based on the publication’s short-term citation history.|Adams| 
( 2005| l, Levitt and Thelwall ( 201 l| l, Bornmann et al. ( 2013| l, and J. Wang ( 2013| l present empirical analyses of the 
correlation between short-term and long-term citation counts. Based on an analysis of publications from 1993 in 
six fields in the physical and life sciences, Adams p005| l concludes that “across reasonably large samples of re¬ 
search publications (not individual papers) it is possible to use initial citation counts predictively to index emerging 
quality relative to the field” (p. 579). |J. Wang| ( [20T3l l performs an analysis of all publications from 1980 indexed 
in the Web of Science database and reports that the Spearman correlation between short-term citation counts and 
citation counts after 31 years “rises from 0.266 in year 1 to 0.756 in year 3, and then slowly reaches 1 in year 
31” (p. 866). Studies based on correlations reveal general patterns. Individual publications may of course strongly 
deviate from these patterns. Extreme deviations can be observed in the case of so-called ‘sleeping beauties’, which 
are publications that are hardly cited for a long time and then suddenly receive a lot of citations ( |Van Raan||2004[ ). 
The phenomenon of sleeping beauties illustrates the difficulty of making accurate predictions of long-term citation 
impact. 

We are aware of three studies in which a comparison is made between the use of early citations and the use 
of impact factors for predicting longer-term citation impact. Abramo, D’Angelo, and Di Costa (20101 compare 
rankings of Italian universities based on citations and based on impact factors. They find that in certain fields, in 
particular in mathematics and in computer sciences, the ranking based on impact factors outperforms the ranking 
based on early citations in terms of the correlation with the ranking based on longer-term citation impact. Levitt and] 
Thelwall (201 l|l propose a combined indicator of the impact of a publication that is obtained by taking a weighted 


average of the number of citations of a publication and the impact factor of the journal in which the publication 
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has appeared. They report that in the case of a citation window of no more than one year the combined indicator 
provides a better prediction of the longer-term citation impact of publications in the held of economics than a 
straightforward indicator based only on citations. These results are in line with the hndings of |Stern| ( |2014| l for 
publications in the helds of economics and political science. Stem ( 2014) 1 reports that shortly after the appearance 
of a publication the combined use of early citations and impact factors yields a better prediction of the longer-term 
citation impact of the publication than the use of early citations only. 

We have now provided an overview of the literature that is most closely related to the research that we present 
in this paper. To make clear how our research contributes to the literature, let us summarize how our research 
differs from existing work; 


• Our interest is in predicting long-term citation impact based exclusively on impact factors and early cita¬ 
tions. As mentioned above, we do not want to use variables that can be easily manipulated or that may cause 
self-reinforcing effects. 


• Our interest is in predicting long-term citation impact within one or two years after the appearance of a 
publication. Unlike some earlier studies (Levitt & Thelwall 2008) [D7 Wang et al.| 2013 M. Wang et al. 


2012 2011^, we do not want to wait for five or more years before making predictions. 


• Earlier work has shown that predicting long-term citation impact is a difficult task. Hence, it cannot be 
expected that the future number of citations of a publication can be predicted with a high degree of accuracy. 
Unlike most earlier work, our interest therefore is in predicting a probability distribution for the future 
number of citations of a publication. This probability distribution represents the uncertainty that we have 
about the number of citations a publication will receive in the future. We want this probability distribution 
to be predicted with a high degree of accuracy. We do not aim to provide a point estimate of the future 
number of citations of a publication. 


• Most earlier studies (an exception is |Glanzel| |1997| l actually consider a simplified version of the problem 
of predicting long-term citation impact. For instance, a study may consider the problem of predicting the 
number of citations that publications from 2005 have received by the end of 2014, where the prediction is 
based on information available at the end of 2006. However, how do we know whether the prediction model 
obtained for this problem will also work well when it is applied to a different time period? For instance, 
will the model also work well to predict, based on information available at the end of 2014, the number of 
citations that publications from 2013 will have received by the end of 2022? This essential question is left 
unanswered in most earlier studies, but it will be addressed in our research. 


3 Data 

In this paper, we use the in-house version of the Web of Science database of the Centre for Science and Technology 
Studies of Leiden University. Only publications of the document types ‘article’ and ‘review’ are included in the 
analysis. For counting citations, author self-citations are excluded. 

In order to estimate the coefficients of the regression model presented in Section we use a specific set of 
publications. Because citation behavior differs between fields of science, only publications in the field of physics 
that were published in 1984 are included. In order to be included, a publication must belong to at least one of the 
following Web of Science subject categories: Applied Physics, Fluids and Plasma Physics, Atomic, Molecular and 
Chemical Physics, Multidisciplinary Physics, Condensed Matter Physics, Nuclear Physics, Particles and Fields 
Physics, and Mathematical Physics. Our entire data set includes 56,207 publications. 

As already explained, we build a regression model with two predictors: The number of early citations of a 
publication and the impact factor of the journal in which a publication was published. In the rest of this paper, 
we will refer to these predictors as covariates. The number of early citations is defined as the number of citations 
received by a publication in the first year after its appearance. It is denoted by ci. In our data set, ci is the number 
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of citations that a publication has received before the end of 1985. Hence, for counting early citations, all citations 
received by a publication in 1984 and 1985 are included. The impact factor (IF) of a journal in 1984 is calculated 
as the average number of citations that publications published in the journal in 1982 and 1983 received in 1984. 
In the calculation of the impact factor of a journal, only publications of the document types ‘article’ and ‘review’ 
are taken into account, both on the citing side and on the cited side. 

In Section fSThl we also consider two sets of publications in the field of physics published respectively in 1990 
and 2000. These sets of publications are used to evaluate the predictive performance of our regression model. In 
Section]^ we address the sensitivity of the regression coefficients to a specific field. To this end, we use a set of 
publications in the fields of biology and chemistry published in 1984. 


4 Regression model for quantile prediction 

As pointed out in Sections[T]and[^ our interest is not in providing a point estimate of the future number of citations 
of a publication. Instead, our focus is on predicting a probability distribution for the future number of citations of 
a publication. More specifically, our aim is to predict the quantiles of this probability distribution. 

Formally, the p-th quantile q{p) of a random variable Y with distribution function F is given by 

q{p) = F~\p) = inf{p : F{y) > p}. 

Hence, saying that a publication scores at the p-th quantile means that the number of citations of the publication 
is greater than or equal to the number of citations of a proportion of p of all publications. 

Our goal is to predict quantiles for the distribution of the number of citations received by a publication starting 
from the second year after its publication date. For example, in our data set of publications published in 1984, 
we consider quantiles for the number of citations received by a publication between January 1986 and December 
2013. We refer to this as the future number of citations of a publication or, alternatively, as the long-term citation 
impact of a publication. In this section, we propose a model that predicts the quantiles of the long-term citation 
distribution of a publication, conditioned on the impact factor and the number of citations in the first year. 


4.1 Models for quantiles 

Like |Ke| ( |2013| l, we assume that each publication has a fitness factor 77 . This fitness factor gives information about 
the competitiveness of a publication relative to other publications in obtaining citations. The fitness factor depends 
on different factors (j)s that contribute to the success of a publication. The fitness factor is assumed to be a product 
of each of these factors raised to some power Sg- 

( 1 ) 

S 


We predict the quantiles of the distribution of future citations, conditioned on the fitness factor. A higher fitness 
factor means that a publication has a higher competitiveness to obtain citations. Therefore one expects that the 
quantiles of publications with a higher fitness factor are higher. We assume that the p-th quantile of publications 
with fitness factor 77 , denoted by q{p\ri), is proportional to 77 : 

q{p\T]) = CpT]. 


Here Cp is a constant independent of 77 for each quantile p. 

We consider three definitions of the fitness factor: A definition based only on the impact factor IF, a definition 
based only on the number of citations in the first year Ci, and a definition based on both IF and ci. For clarity of 
notation, the exponents i5i and S 2 from Eq. Q are relabeled as f3 and 7. The three definitions of the fitness factor 
are summarized in Table[T] The constant fcg is needed to account for publications that have zero citations after one 


year. We will discuss our choice for fcg in Section 5.1 
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Model 

V 

Quantile prediction 

Only IF 

rj (X IF^ 

q{p\IF) = CpIF^^ 

Only Cl 

77 oc (ci -f 

q{p\ci) = Cp (ci -f fco)'^^ 

Eull model 

77 oc IF^{ci + fco)''' 

q(p|/F,Ci) = Cp/F/5p(ci+fco)^'’ 


Table 1; Three models studied in this paper. 


4.2 Quantile regression 

In the model described in Section |4~T| the logarithm of the quantiles is linear in the logarithm of the covariates. 
For example, for the full model in Table[T]we obtain 

In {q{p\IF, Cl)) = 7 p ln(ci + /cq) + Pp ln(/F) + Cp, (2) 

where Cp = In(C'p). Because the logarithm is an increasing function, the logarithm of the p-th quantile is equal to 
the p-th quantile of the log-transformed citation counts. This means that we can take the logarithm of the number 
of citations and then fit Eq. (|^ to the quantiles of those values. 

Equation (|^ is fitted using quantile regression introduced by |Koenker and Bassett] ( |T978| l. While in standard 
least squares regression the sum of squared errors is minimized, quantile regression minimizes a different function. 
It solves 

Pp(yi - x^^). (3) 

i 

Here yi is the logarithm of the future number of citations of publication i, Xi is the vector of log-transformed 
covariates corresponding to publication z, = [Cp /3p 7 p], and the function pp is defined as 


Pp{z) = zp- zlz<0 = 


zp 

z{p - 1) 


if z > 0 
if z < 0. 


Equation Q minimizes the difference between p and the fraction of negative residuals 
1978[ ). Hence, when all values are different, the empirical quantiles for the future number of citations are fitted 
precisely. In our case, many publications have the same number of citations, and therefore there can be small 
differences between actual and fitted quantiles (for an illustration, see the blue dots in Eigure[TT|below). 


(Koenker & Bassett 


5 Quantile regression results 

In this section, we apply the regression model from Section|^to the data described in Section]^ 

5.1 Model coefficients 

Erom a research evaluation perspective, high quantiles of citation distributions are especially important because 
interest often focuses on identifying high-impact research. Eor this reason, in our analysis we consider the 0.50-th 
up to the 0.99-th quantile. Eigures [T^ and [T^ show the parameters Cp, /3p and 7 p resulting from the quantile 
regression. The coefficients are shown for the three different versions of the model listed in Table [T] 

The coefficient Cp in Eigure[^is increasing in p for all three versions of the model. This is to be expected, 
because the quantiles are nondecreasing in p. We also see that Cp is a convex function. Eor the higher quantiles, 
Cp grows faster in p than for the lower quantiles. This indicates that for example the 0.98-th and the 0.99-th 
quantile are further away from each other than, say, the 0.60-th and the 0.61-th quantile. In Section]^ we will 
explain the behavior of Cp more precisely using quantile estimators for Pareto-tailed distributions. 
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(a) Cp (b) Pp (c) 7 p 

Figure 1: Quantile regression coefficients forp-th quantile versus p for the different models. 


The coefficients f3p and 7 ^ are decreasing in p. Hence, the impact factor and the number of citations in the 
first year have less influence on the long-term citation impact of highly cited publications than on the long-term 
citation impact of publications with an average number of citations. 


5.2 Influence of fco 

The parameter ko in the models listed in Table[T]is not fitted in the quantile regression. To get an understanding of 
the influence of ko on the regression coefficients, quantile regression is used to obtain the coefficients Cp, jPp, and 
7 p for several values of kg. Figures andj^show the values of the regression coefficients for ko = 0.3 to 
1.5 for the full model. We see that fcg has hardly any influence on f3p. Also, ko does not have much influence on 
7 p and Cp. Essentially, these coefficients increase or decrease by a constant value if ko changes. We use the value 
of ko that minimizes the sum of the squared difference between the fraction of publications with fewer citations 
than the predicted p-th quantile and p. This results in ko = 0.5. 



Figure 2: Quantile regression coefficients forp-th quantile versus p for ko ranging from 0.3 to 1.5. 


5.3 Fit of the models 

We now investigate the fit of the models listed in Table [^to the data. The ht of the model that uses only the impact 
factor is illustrated in FigureFor every value of the impact factor, the predicted 0.50-th, 0.80-th, and 0.95-th 
quantiles are shown as solid lines and the empirical quantiles are shown as dots. It is clear that the predicted 
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and empirical quantiles may differ a lot. In a similar manner, the fit of the model that uses only the number of 
early citations is illustrated in Figure]^ For the 0.50-th quantile, the empirical and the predicted quantiles almost 
overlap. For the 0.80-th and the 0.95-th quantile, the model fits well for publications with a small number of 
citations in the first year, but it underestimates the quantiles for publications with a large number of early citations. 



IF 


Figure 3; Predicted value (solid line) and em¬ 
pirical value (dots) of the 0.50-th, 0.80-th, 
and 0.95-th quantile versus IF for the model 
that uses only IF. 

To illustrate the fit of the full model, we plot the predicted quantiles against the empirical quantiles. To do so, 
we first create groups of publications. A group consists of publications that all have the same ci and the same 
impact factor, where impact factors have been rounded to halves. Figures andshow all groups that include 
at least 50 publications. The figures relate to, respectively, the 0.50-th, 0.80-th, and 0.95-th quantile. Each dot in 
the figures corresponds to a group of publications with the same IF and ci. The 45-degree lines are shown as 
a reference. In the case of a perfect fit, all dots should be located on the 45-degree lines. We see that for many 
groups of publications predictions are quite accurate, but there are also quite some groups for which there is a 
large difference between the predicted and the empirical quantile. Taking a closer look at the data, we see that, 
naturally, a better fit is obtained for larger groups. In Figure]^ the predicted quantiles are again plotted against 
the empirical quantiles, but only groups including at least 500 publications are shown. The 0.50-th, 0.80-th, and 
0.95-th quantiles are presented in the same plot. For the 0.50-th and 0.80-th quantile, the fit is excellent. For the 
0.95-th quantile, there are more outliers. We note that estimates for high quantiles will be explored further using 
quantile estimators for Pareto tails in Section]^ 

5.4 Comparing the fit of the different models 

Figures]^ 1^ and [^illustrate the difference in the fit of the three models. Let / be the fraction of publications 
with fewer citations than the predicted 0.50-th quantile. For different groups of publications, the figures show 
/ — 0.5. If the model predicts correctly, we expect this value to be close to zero. This is indicated by a green color 
in the figures. Red colors represent positive values, indicating that the model overestimates the 0.50-th quantile for 
a group of publications. Blue colors correspond to negative values, which means underestimation of the 0.50-th 
quantile. Each rectangle in the figures represents a group of publications with the same impact factor and the 
same number of citations in the first year. Eor example, for the model that uses only the impact factor, the model 
overestimates the 0.50-th quantile for publications with an impact factor of 2.5 and with 0 citations in the first 
year. It underestimates the 0.50-th quantile for publications with an impact factor of 2.5 and with 5 citations in the 
first year. 



Eigure 4; Predicted value (solid line) and em¬ 
pirical value (dots) of the 0.50-th, 0.80-th, 
and 0.95-th quantile versus ci for the model 
that uses only ci. 
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Figure 5: Predicted versus empirical 0.50-th 
quantile for groups with at least 50 publica¬ 
tions. 



Figure 7; Predicted versus empirical 0.95-th 
quantile for groups with at least 50 publica¬ 
tions. 



Figure 6 ; Predicted versus empirical 0.80-th 
quantile for groups with at least 50 publica¬ 
tions. 



Figure 8 ; Predicted versus empirical 0.50-th, 
0.80-th, and 0.95-th quantile for groups with 
at least 500 publications. 


Based on these figures, we see that the model which uses only the impact factor does not predict very well for 
publications with either a small or a large number of citations in the first year. Likewise, the model that uses only 
the number of early citations does not predict very well for publications with either a low or a high impact factor. 
Similar figures can be created for other quantiles instead of the 0.50-th quantile. From these figures we conclude 
that the full model yields more accurate predictions than the other two models. This means that both the impact 
factor and the number of citations in the first year provide important information for predictive purposes, and that 
impact factors and early citations should therefore be used together to obtain accurate predictions. For this reason, 
in the remainder of the results, the full model is used. 

5.5 Predicting the conditional citation distribution 

Using the quantile regression model, we can predict the entire conditional distribution of the number of citations. 
Figures [T0a| and [TOb] show the predicted and empirical conditional distribution for publications that have impact 
factor zero and zero citations in the first year and for publications that have impact factor one and one citation in the 


9 













(a) Full model. 


(b) Model with only IF. 


(c) Model with only ci. 


Figure 9: Let / be the fraction of publications with fewer citations than the predicted 0.50-th quantile. The figures 
show / — 0.5 for the three models for different ci and IF. 


first year respectively. It is clear that publications with the same impact factor and the same number of citations in 
the first year, may have different citation numbers after 30 years. For this reason, predicting the entire conditional 
distribution is more valuable than giving a point estimate on the number of citations that publications receive. 
Furthermore, the conditional distributions in the two figures are different, which indicates that it is important to 
take into account the influence of the impact factor and the number of early citations. The quantile regression 
method predicts the conditional distribution quite accurately, especially for the publications with an impact factor 
of zero, and zero citations in the first year. 




(b)/T = l,ci = 1 


Figure 10: Empirical and predicted conditional distribution function of the number of citations after 30 years. 


5.6 Predictions for later publications 

In this section, we test whether the model fitted based on older publications also predicts well when applied to 
more recent publications. To this end, we first estimate the quantile regression coefficients for publications in the 
field of physics published in 1990. The model is fitted to predict the quantiles of the number of citations that these 
publications have received by the end of 2000. We then use the resulting model to predict the quantiles of the 
number of citations that publications in the field of physics published in 2000 have received by the end of 2010. 
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The predictive performance of the model is illustrated in Figure We compute the fraction of publications 
that have received fewer citations than their predicted p-th quantile. If the model predicts well, this fraction should 
be p. In Figure p is plotted against the fraction of publications with fewer citations than their predicted p-th 
quantile. Results are shown both for publications from 1990 (which were used to fit the model) and for publications 
from 2000 (which were not used in model fitting). The 45-degree line is included as a reference. In the case of a 
perfect fit, all dots should be located on the 45-degree line. 

For publications from 1990, the quantiles are predicted almost perfectly, which is to be expected because we 
use quantile regression (see Section]?^. However, when the model is applied to publications from 2000, we see 
that the quantiles are underestimated. For example, only around 43% of the publications from 2000 have received 
fewer citations than their predicted 0.50-th quantile. This disappointing result must be due to structural changes 
that have taken place over time and that cause a model fitted to older data not to perform well when applied to 
newer data. In particular, there is a trend to include more and more references in publications, and as a result of 
this trend, the average number of citations that publications receive has increased over time ( [Wallace, Lariviere^ 
|& Gingras||2009| ). Because the model is fitted based on older publications, which have lower citation counts than 
more recent publications, the model underestimates the quantiles for more recent publications. 

We want to adjust the predictions of the model for the increase over time in the average number of citations per 
publication. To do so, we make predictions based on normalized data. This means that all inputs and outputs of the 
model are divided by their average value. For example, the number of citations of a publication in the first year. 
Cl, is divided by the average value of ci over all publications. Similarly, the number of citations of a publication 
after 10 years is divided by the average number of citations after 10 years over all publications. The quantile 
regression model is fitted on the normalized data from 1990. The data from 2000 are normalized in the same way, 
and the model fitted based on publications from 1990 is used to predict the quantiles for publications from 2000. 
The resulting predictions are normalized predictions with respect to the average number of citations after 10 years 
over all publications from 2000. Hence, a predicted quantile of for example 2 means that the quantile equals twice 
the average number of citations after 10 years over all publications from 2000. 

Figure[T^shows that the performance of the model for publications from 2000 has become just as good as for 
publications from 1990. Like in Figure [TT| p is plotted against the fraction of publications with fewer citations 
than their predicted p-th quantile. The good performance of the model based on normalized data indicates that the 
increasing number of citations received by publications is indeed responsible for the underestimation that can be 
observed in Figure [TT] The use of normalized data solves this problem. 



Figure 11: p versus the fraction of publica¬ 
tions with fewer citations than their predicted 
p-th quantile. 



Figure 12: p versus the fraction of publica¬ 
tions with fewer citations than their predicted 
p-th quantile. An adjustment has been made 
for increasing citation counts over time. 
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6 Tail quantiles 


In this section, we take a closer look at high quantiles using a quantile estimation technique from extreme value 
theory. 


6.1 Tail of the citation distribution 


In the literature, a lot of attention has been paid to the tail of citation distributions, and in particular to the 
question whether this tail follows a Pareto distribution. The possibility of a Pareto tail was already suggested 
by |De Solla Price ( 1976|l. Redner ( 1998| l analyzed a large data set of publications and their citations and observed 
that the tail of the distribution of citations over publications can be described by a power law. Clauset, Shalizi, and] 
[Newman] (|2009[) proposed a statistical methodology for testing the presence of power-law behavior in empirical 
data. Based on the same data set as [Redner ( [1998j l, they concluded that for citation distributions a power-law tail 
cannot be ruled out. The methodology of Clauset et al.|(2009i was also used by Albarran, Crespo, Ortuno, and 


Ruiz-Castillo (201 l|l, who found that in a large number of scientific fields citation distributions seem to have 


power-law tail. 

Formally, let X be the number of future citations. The random variable X has a Pareto tail if for some xi we 
have 

P{X > x) = wx~°‘, X > Xi- 

Here w and a are parameters, a is also called the tail exponent. 

There are many ways to test whether a distribution has a Pareto tail. Here we use the Zenga plot proposed 
by Cirillo p013| l. The motivation behind this method is that it allows distinguishing between a Pareto tail and a 
lognormal tail, while many other methods, such as the QQ-plot, fail to detect this difference. 

Let F be the distribution function of a random variable X. The Zenga curve Z is defined as 


Z{u) = 
Q~{u) = 
Q+{u) = 


1 - 


Q (u) 


Q+{u) ’ 
T^fnd^~'(s)ds, 


0 < u < 1, 
0<u<l, 
0<u< 1. 


Hence, the Zenga curve is a measure of how much weight of the distribution lies below the u-th quantile relative to 
how much weight lies above the u-th quantile, as a function of u. The Zenga curve has different shapes for different 
distributions. For the lognormal distribution, the Zenga curve is a straight line, while for Pareto distributions, the 
Zenga curve is a convex increasing function ( |Cirillo[ 2013| l. 

Figures [T^ and 14 show the empirical Zenga plot for groups of publications with either the same number of 
citations in the first year or the same impact factor. The Zenga curves are clearly convex increasing functions. 
This indicates that for all groups of publications the citation distribution has a Pareto tail. 


6.2 Pareto quantile estimator 


Under the assumption of a Pareto tail, we can estimate high quantiles using the estimator proposed by Dekkers et 
al. ( 1989|l. For a group of publications with given IF and Ci, this estimator is of the form 


q{p\IF,ci) = ) ( - 


k 


ii-p) 


lie 


(4) 


Here n := n{IF,ci) is the number of publications in the group of publications with given IF, and ci and 
X{n-k,n) '■= X{n,n-k) {IF, Cl) is the number of citations of the fc-th most cited publication in this group. Further¬ 
more, k := k{IF, Cl) is the threshold where the Pareto tail starts. So the k publications with the largest number 
of citations follow a Pareto distribution. 
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Figure 13: Zenga plot for different ci. 


Figure 14: Zenga plot for different IF. 


The Pareto tail starts at the (l — -th quantile. This threshold quantile will be called the p*-th quantile. Thus, 
X(n- k,n) is the empirical value of the p*-th quantile. The tail index a is estimated using the Hill estimator (Hill 
19751. The threshold k and hence the threshold quantile p* is estimated using the procedure suggested by (Beirlant, 


Glanzel, Carbonez, & Leemansj |2007[ ). This procedure minimizes an approximation of the asymptotic mean 
squared error of the estimate of a. The resulting estimate forp* is p* = 0.95. 

plots the quantiles that are predicted by Eq. 0 against the empirical quantiles, when X(^n-k,n) 


15 


Figure 

is given. I^ote that the accuracy of the predictions is inconclusive because we rely on Xr„_ 


(n—fe,n)’ 


which is not 

known in practice. However, we see that Eq. 0 captures the behavior of high quantiles quite accurately. In the 
next section, we will obtain insight into the predictions for high quantiles by linking the Pareto estimator to the 
quantile regression estimator. 


6.3 Linking the Pareto and regression estimators 

As mentioned above, the problem with Eq. 0 is that it uses X(^n-k,n), the empirical value of the p*-th quantile, 
at which the Pareto tail starts. In practice, X(^n-k,n) is not known and needs to be predicted. A natural way to 
overcome this is to replace X(n-k,n) by the predicted p*-th quantile from the quantile regression: 

q{p*\IF, Cl) = (ci + kor-*. 

We then obtain the following estimator for the tail quantiles: 

q{p\IF,c,)= Cp.IF^’^^Ci+koy’^-, p>p*, (5) 

where we used the identity ^ = 1 — p*. 

We can now explain the behavior of Cp = In(C'p) in Pigure[la|by comparing Eq. 0 to the regression estimator. 
If the two estimators were equal, then for p > p* we would have 

(l^) ^ Cp.IF^’’'{c^+koy’’‘ =CpIF^-{c,+kor-. 
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Hence, if we assume that fip and 7 p are constant for large values of p, this suggests that we may use ( j Cp* 
for Cp. By taking the logarithm, we obtain the following proxy Cp for the regression coefficient Cpi 


Cp = Cp. + ^ (ln(l-p*) - ln(l-p)), P>P*- (6) 

In Figure 

completeness, we plot this line for all p S [0.5,0.99]. The red dots correspond to the regression coefficients Cp. 
We see that Eq. (|^ indeed can be used as an analytical description of Cp when p > p*. For p > 0.95, there is an 
excellent agreement between Cp and Cp. In fact, Eq. (|^ shows a good agreement for all p > 0.9. 
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Cp and Cp are plotted. The blue line corresponds to Cp given by Eq. (|^ with p* = 0.95. For 



Figure 15; Predicted p-th quantile obtained 
from the Pareto estimator versus empirical p- 
th quantile forp = 0.96 andp = 0.98 and for 
groups with at least 50 publications. 



Figure 16: Cp from Eq. (|^ (blue line) and Cp 
from regression (red dots) for p-th quantile. 


7 Sensitivity of parameters to the field of science 


In the previous sections, data from the field of physics was used to fit the coefficients of the regression model. 
In this section, we study the influence of the field of science on the regression coefficients. We again consider 
publications published in 1984, and the quantile regression is again used to predict the conditional quantiles of 
the distribution of the number of citations received by these publications by the end of 2013. The coefficients Cp, 


Pp, and 7 p obtained from the quantile regression are plotted in Figures 17a 17b and 17c for three different fields: 
Biology, chemistry, and physics. 

The regression coefficient Cp is higher for publications in biology and chemistry than for publications in 
physics. Other things being equal, this means that publications in biology and chemistry receive more citations 
than publications in physics. This is the case mainly for the lower quantiles. For the higher quantiles, the dif¬ 
ferences in the coefficient Cp are small. The regression coefficient /3p is lower for biology publications than for 
publications in physics and chemistry. This means that in biology the impact factor is less determining for the 
long-term citation impact of publications. However, the differences in this coefficient are small. The coefficient 
7 p is highest for publications in physics. Hence, in physics the number of citations that a publication has received 
in the first year is more determining for the publication’s long-term citation impact than in biology and chemistry. 
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(a)Cp 


(b)/3p 


(c) 7 p 


Figure 17; Quantile regression coefficients forp-th quantile versus p for three fields of science. 


8 Conclusions 


We have proposed a model to predict a probability distribution for the future number of citations of a publication. 
Two predictors are considered in the model; The impact factor of the journal in which a publication has appeared 
and the number of citations received by a publication in the first year after its appearance. The proposed model 
is based on quantile regression. The good fit of the model indicates that quantile regression is a suitable tool to 
predict the quantiles of the probability distribution of a publication’s future number of citations. We have found 
that the quantile regression coefficients jSp and 7 ^, corresponding to respectively the impact factor and the number 
of early citations, are not stable in the quantile p. Hence, the influence of the impact factor and the number of 
early citations on the long-term citation impact of a publication is different for different quantiles. 

Three variants of our prediction model have been studied. The variant in which both the impact factor and the 
number of early citations are used turns out to fit the data better than the variants in which only one of the two 
predictors is included. This means that both the impact factor and the number of early citations are important to 
predict the probability distribution of a publication’s future number of citations. 

Importantly, our proposed model provides accurate predictions also for publications that were published later 
than the publications used for estimating the model parameters. However, in order to obtain these accurate predic¬ 
tions, it is necessary to normalize all inputs and outputs of the model by their average value. 

We have also investigated the tail of the citation distributions obtained in our analysis. Zenga plots ([Cirillo 


2013| ) have been used for this purpose. It turns out that the tail of our citation distributions can be approximated 
by a Pareto distribution. Using an estimator for the tail quantiles of a Pareto distribution (Dekkers et al. 19891, 
we have obtained an explicit equation for the regression coefficient Cp in our model for high quantiles p. 

There are a number of issues that require further research. First of all, further research may focus on the fitness 
factor that we use in our model. Following |Ke| ( |201~3] l, we have assumed that the fitness factor is a product of our 
two predictors each raised to a certain power. Other ways of modeling the fitness factor may also be investigated. 

The analysis presented in this paper is based on publications in the field of physics. This is a broad field 
consisting of many different subfields. These subfields probably all have their own citation practices. Differences 
in citation practices between fields or subfields have not been taken into account in our prediction approach. 
Further research may focus on linking our prediction approach to the literature on field normalization of citation- 
based indicators. 

Another issue for further research is the use of other predictors, in addition to impact factor and early citations. 
In Section]^ we already suggested some possibilities; Number of downloads of a publication, number of readers 
according to a service such as Mendeley, and other types of altmetric indicators. Further research may investigate 
the effect of adding these predictors to our model. In particular, it would be interesting to find out whether the use 
of additional predictors decreases the level of uncertainty in predictions of long-term citation impact. 
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Finally, perhaps the most challenging issue for further research is to make predictions of long-term citation 
impact not only for individual publications but also for the entire publication oeuvre of a researcher, a research 
group, or a research institution (e.g., |Acuna, Allesina, & Kording[[2^012[|Bornmann[[2()13| l. Moving from predic¬ 
tions at the individual publication level to predictions at the level of oeuvres of publications is far from trivial. A 
prediction approach that yields accurate results at the individual publication level may provide biased results when 
it is used at the level of oeuvres of publications. 
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